








DOCUMENT RESUME 

24 



EM 007 007 



ED 026 856 

By'Osburn, H. G.; Shoemaker, David M. 

Pilot Project on Computer Generated Test Items. 

Spons Agency" Office of Education (DHEW), Washington, D.C. Bureau of Research. 

Bureau No" BR *6 "8533 
Pub Date 1 Jun 68 
Grant- OEG* i -7-068533-39 i 7 
Note- 1 7 Ip. 

EDRS Price MF-S0.75 h'C"$8.65 . T . 

Descriptors-*Achievement Tests, Evaluation Techniques, * Measurement Techniques, *Test Construction, lest 

Interpretation, Test Selection, Test Validity 

A computer program generating question series for achievement examinations 
was presented and the relative reliability of computer "genera ted and 

instructor-selected items was investigated. To provide validity for examinations 
generated by an original computer program, representative processes of 
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computer -genera ted examinations were considered substantially encouraged by the 
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Summary 



This study is based on the concept that it is possible to define 
what a test is measuring by specifying operational procedures for 
the construction and sampling of test items. The implementation of this 
point of view involves definition of meaningful stimulus classes and 
systematic sampling from the classes so defined. This study explores 
the possibilities of using a digital computer for item sampling from 
predefined stimulus classes. 

The primary purpose of the study was to tryout the concept of 
computer generated test items in the context of an actual course of 
instruction to determine the operational feasibility of the techni- 
que. The study consisted of three phases (1) development of a com- 
puter item generating program, (2) specification of a system of item 
forms in the content area of elementary statistics and (3) tryout of 
item sentences sampled from the universe of content using college 
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students in elementary statistics . 

The criteria used to evaluate the computer generated item tech- 
nique consisted of (1) the reliability of computer generated items 
compared with instructor made items (2) student reaction to the 
technique and (3) general experience in attempting to generate items 
by computer. 

The results of the study suggested that the computer gener- 
ated test items used in the study were slightly less reliable than 
the instructor made items. However, the reliability of the computer 
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generated items was not unacceptably low. Student reaction to the 
technique was generally positive and there were increasingly favor- 
able reactions as some of the bugs were worked out of the method. 

Experience with using a computer to generate test items 
suggested that the method used in this study was quite limited and 
that more flexible data structures will be required. 
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Chapter 1 




Background and Purpose 
A. Theoretical Background 

An extended discussion of the theoretical background for this study 
has recently been published in the Journal of Educational and Psychologi- 
cal Measurements , Osburn (1968) ; therefore only a condensed version is 
offered here. The interested reader is referred to the longer paper. 

The basic theoretical concept is that the objective of achievement 
testing is generalization to a well defined universe of content. '.'Je 
are usually not intrinsically interested in an examinee's performance 
on the particular items in a test. Rather we would like to make infer- 
ences regarding his knowledge and skills with respect to some larger 
content domain. The typical achievement test is an arbitrary collection 
of items - of little value unless valid inferences can be made regard- 
ing the examinee's behavior in some wider context. 

The usual approach to the measurement of achievement is to think 
of the examinee as possessing a measureable amount of "knowledge" where 
knowledge has the status of a hypothetical construct mediating behavior 
on the test with other important behaviors of the examinee. Knowledge 
is conceptualized implicitly as a latent hypothetical continuum and the 
measurement problem is reduced to a question of making inferences about 
the latent hypothetical continuum from analysis of responses to test 
items. Somewhere along the line the hypothetical continuum is given a 
name such as number facts , 10th grade mathematics, etc. and the 
illusion that something meaningful is being measured is complete. 

There are many serious problems with the above described approach 
to achievement testing. First and foremost it is very difficult to 
establish what an achievement test is measuring in functional terms. 

The usual strategy is to attempt to determine the construct validity of I 

the test. This seems reasonable except that in actual practice it often 1 
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comes down to correlating scores on one set of arbitrary items with 
a second set of equally arbitrary items where both sets are referred 
to the same or similar constructs. This is not to say that achieve- 
ment testing is completely arbitrary. Subject matter outlines are 
drawn up and the items are often distributed in some systematic fash- 
ion across subject matter elements. However, as a rule items are not 
sampled in any rigorous sense and there is not a direct link between 
the definition of the universe of content and the items that appear 
on any particular test. To establish such a link requires that all 
items that could possibly appear on the test to be specified in 
advance so that random or stratified random sampling can be rigor- 
ously implemented. 

The basic strategy of the present study is to attempt to define 
what the test is measuring by specifying the operational procedures 
for the construction and sampling of test items. Validity is not est- 
ablished solely by reference to the responses of examinees, but rather 
by a careful definition of the stimuli. To paraphrase Hively, Patter- 
son and Page (1968) - Classes of stimuli may be defined by stating 
sets of relevant and irrelevant properties. Classes of responses may *t 
be defined by stating one or more properties or criteria. Knowledge 
may then be operationally defined as a functional relation between 
certain classes of stimuli and classes of responses. One can "diag- 
nose" an individual's knowledge by testing him with sample of stimuli, 
varying the stimulus properties systematically, and observing the 
occurence of defined responses. 

In principle at least the validity of an achievement test may be 
established for a single subject by showing that a functional relation- 
ship exists between classes of stimuli and classes of responses. The 
implementation of this point of view involves definition of meaningful 
stimulus classes and the systematic sampling from the stimulus 
classes so defined. In the author's opinion the definition of stim- 
ulus classes is the principle problem in achievement testing. 
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One possibility for the systematic sampling of items from a 
defined set of stimulus classes is to analyze the content domain into 
a hierarchical arrangement of item forms and develop a program for a 
digital computer that will compose item sentences given a suitable 
vocabulary and structural codes for the item forms . An item form has 
the following characteristics: (1) it generates items with a fixed 

structure; (2) it contains one or more variable elements; and (3) it 
defines a class of item sentences by specifying the replacement sets 
for the variable elements. An item form may be very general or 
abstract or quite specific and particular. The analysis of a content 
domain into item forms proceeds from the general to specific in much 
the same way as an ordinary subject matter outline with one crucial 
difference - in item forms analysis there is an unbroken link between 
the abstract system and the individual item sentence. This property 
makes it possible to unambiguously define a universe of content as an 
hierarchical arrangement of item forms together with the replacement 
sets for the variable elements. 



B. Purpose 

The primary purpose of this study was to tryout the concept of 
computer generated test items in the context of an actual course of 
instruction to determine the operational feasibility of the technique. 
The statistical characteristics of computer generated items as compared 
with instructor made items and student reactions to the computer 
items were the principle criteria used to access feasibility along with 
experience in attempting to actually implement the procedure. 
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Chapter 2 

Method and Procedures 



A. The Computer Item Generating Program 

During the summer of 1967 a computer program was developed by 
David Shoemaker and the author for generating test items using the 
item form concept. The program was multi-purpose in the sense that 
(1) it could accept as data the raw material for item forms; (2) it 
could stratify item forms into classes or strata for sampling pur- 
poses; and (3) it could generate random item sentences according to 
the sampling plan specified by the investigator. The program was in 
block form in the sense that the various phases of the item genera- 
tion process were independent of each other and could be initiated 
by means of a control card. The process of item generation was 
broken down into the following phases: 

1 . Coding of Replacement Sets - Replacement sets for item forms 
were inputted as character data. The computer program coded the 
replacement set in such a way that the set could be referenced and 
an element of the set could be randomly selected as needed. 

2. Random Number and Frequency Distributions - The program pro- 
vided for the generation of several types of random numbers, frequency 
distributions, probability distributions and joint distributions. 

The program operated on code read in as part of an item form or ran- 
dom replacement set. The code specified the desired characteristics 
of the random number, frequency distribution, etc. 

3. Coding of Item Forms - Item forms were inputted as character 
data, coded references to random replacement sets, and coded refer- 
ences to random numbers . The computer coded the item forms in such 
a way that the item form could be referenced by number and the com- 
puter could assemble the various elements of an item form and print 
out a particular item sentence. 
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4* Stratif ication of Item Forms - Stratification of item forms 
was accomplished by inputting item form code numbers referenced to 

the desired strata. Thus, stratification could be modified by data 
input . 

Generat ion of Tests - The computer program generated tests by 
selecting one random item sentence from each stratification referenced 
by tho input command. If more than one item per strata was desired, 
the strata was multiple referenced. 

The program was written in FORTRAN IV compatible with the Sigma 
7 and the 7090 series computers. Four tapes were utilized for data 
storage. Data for about 100 item forms could be stored and processed 
in one pass through the computer. The users manual describing the 
control cards and the various random number and format codes is pre- 
sented in Appendix E of this report . The program statements are 
presented in Appendix F. 



B. Development of Item Forms 

The first stage in the development of the item forms used in this 
study was to construct a behavior list covering significant tasks that 
the competent student should be able to perform correctly. The scope 
of the behavior list was roughly equivalent to chapters 1-10, 16 and 
17 in Statistics for Psychologists by William L. Hays. The behaVior 
list represents a rather molar analysis of the chosen topics in elemen- 
tary statistics and assumes that the student has access to a text and 
class notes. As it turned out many of the items on the behavior list 
were not applicable to the classes in elementary statistics on which 
data were collected. The text actually used in the experimental 
classes was Fundamental Statistics in Psychology and Education by J. P. 
Guilford# For this reason many of the items on the behavior list 
were not used in the present study. The behavior list is presented 
in Appendix A of this report. 



Item forms were developed by taking each element of the behavior 
list and attempting to define one or more item forms for the behavior 
element. The item forms that emerged were heavily computationally 
oriented. This was partly due to the open book type of examination for 
which the item forms were designed, and partly due to the bias of the 
author. A writing team would be required to develop a really compre- 
hensive set of item forms. The objective of this pilot study was to 
evaluate the feasibility of the procedure rather than develop a com- 
prehensive set of item forms. 

One other characteristic of the item forms used in this study was 
that they were not completely specified as to content. Only the general 
structure was specified and the actual content of the item form was 
developed as it was composed for computer input. The item form list is 
presented in Appendix B of this report. One random item sentence 
from each item form is presented in Appendix G. 



C. Experimental Tryout of Item Forms 
1. Samples 

Two samples were used in this study. The first sample con- 
sisted of 27 students in a senior level course in elementary statistics 
for psychologists at the University of Houston during the fall semester 
of 1967. The text for the course was Fundamental Statistics in Psychol 
ogy and Education by J. P. Guilford. The course was taught by the 
author. The general characteristics of the sample were as follows: 

Of the 27 students 78% (21) were taking their first statistics course. 
While the majority of the students were psychology majors (14), a 
wide variety of majors were represented: biology, math, speech, econ- 

omics and English. The mean age of the sample was 24.7 years (SDs4.8) 
and 67% (18) were male. A majority were undergraduates (16). 

A second sample of students taking the same course was 
studied during the spring semester of 1968. The instructor and text 
were the same as for the first sample. The characteristics of the 
second sample were as follows: Of the 21 students 86% were taking 
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their first course in statistics. Only about one-fourth of the students | 

were psychology majors with a wide variety of majors other than psychol- j 

ogy represented. The mean age of the sample was 24.48 years and 71% I 

(15) were male. Thirteen were undergraduates and 8 were graduate students. 1 
2. Experimental Tests | 

Three tests were administered to the Fall -1967 sample. For ] 

comparison purposes the tests were composed of a mixture of computer J 

generated and instructor made items. The item composition of each test | 

is presented in Table 1. It is important to note that the computer f 

generated items were not truly random item sentences as some selection f 

I 

among computer generated items was required due to difficulties with 1 

the computer program. It can be said that the computer generated items I 

1 

were representative but not truly randomly sampled. All tests used j 

in the study are presented in Appendix C of this report. j 

I 

Since it was necessary to terminate the study prior to the I 

end of the spring semester 1968, only two experimental tests were stud- 

j 

ied for the spring 1968 sample. The composition of these two tests f 

I 

is also presented in Table 1. j 

f 

J 

One to two weeks prior to each test samples of two random | 

J 

item sentences from each item form that could appear on the test were l 

1 

passed out to the students as study guides. The students were told I 

that some of the items on the forthcoming test would be randomly sampled jj 

from the same universe of content as the sample items. It was made | 

• | 

clear that in all probability exact duplicates of the sample items 

would not appear on the test. j 



3. The Student Questionnaire 

A questionnaire was constructed for the purpose of assessing 
student reaction to the computer generated items. This questionnaire 
was given to the fall-1967 sample just after the final examination in 
the course. It was given to the spring-1968 sample abput one week after 
the second examination. A copy of the student questionnaire is in 
Appendix D of this report. 



5 

l 



I 



- 9 - 







Chapter 3 

Results, Discussions and Conclusions 

A. Results on the Fall -1967 sample 

1 . Statistical Characteristics of Computer Items 

The three tests given to the fall -1967 sample contained a 
mixture of instructor made and computer generated items so that com- 
parisons could be made. It should be emphasized that these compari- 
sons are in no way definitive since the instructor made items were 
arbitrary and the computer program from time to time generated defec- 
tive items so that these items were not truly randomly sampled. Never- 
theless, a rough idea of the statistical characteristics of the com- 
puter items can be obtained while recognising the limitations of the 
study. 

The item means, standard deviations and intercorrelations 
for the three tests are presented in Tables 2a, 2b and 2c Inspection 
of these tables shows that the items within a particular test were 
moderately intercorrelatod with test 2 having the most homogeneous 
items. Also the item total score correlations are in the expected 
range with the exception of two items (item 5 in test 1 and item 2 
in the final examination) . Both of these items were computer gener- 
ated. The suggestion from these data is that the computer generated 
items may be a little less homogeneous than the instructor made items. 

The overall results are presented in Table 3. These data 
show that the computer generated items were slightly less reliable per 
item than the instructor made items. This is also reflected in the 
slightly lower average item-sum score correlations for the computer 
generated items. Test 1 consisting of computer items only showed the 
lowest reliability. Thus the weight of the evidence points to a slightly 
lower reliability for the computer items. On the other hand the dif- 
ferences are slight suggesting that the price in lower reliability that 
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Table 1 

Item Composition of Tests 



Fall-1967 Sample 
Item Classification 


Test 1 


Test 2 


Final 


Total 


Instructor Items 


0 


4 


4 


8 


Computer Items 


7 


5 


6 


18 


Total 


7 


9 


10 


26 



Spring- 1968 Sample 
Item Classification 


Test 1 


Test 2 


Total 


Instructor Items 


3 


i 


4 


Computer Items 


5 


6 


11 


Total 


8 


7 


15 
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- Computer-generated item; I - Instructor-made item 












- Computer -generated item; I - instructor -made item 



